{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 多GPU计算的简洁实现\n",
    "\n",
    "在Gluon中，我们可以很方便地使用数据并行进行多GPU计算。例如，我们并不需要自己实现[“多GPU计算”](multiple-gpus.ipynb)一节里介绍的多GPU之间同步数据的辅助函数。\n",
    "\n",
    "首先导入本节实验所需的包或模块。运行本节中的程序需要至少2块GPU。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "1"
    }
   },
   "outputs": [],
   "source": [
    "import d2lzh as d2l\n",
    "import mxnet as mx\n",
    "from mxnet import autograd, gluon, init, nd\n",
    "from mxnet.gluon import loss as gloss, nn, utils as gutils\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多GPU上初始化模型参数\n",
    "\n",
    "我们使用ResNet-18作为本节的样例模型。由于本节的输入图像使用原尺寸（未放大），这里的模型构造与[“残差网络（ResNet）”](../chapter_convolutional-neural-networks/resnet.ipynb)一节中的ResNet-18构造稍有不同。这里的模型在一开始使用了较小的卷积核、步幅和填充，并去掉了最大池化层。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "2"
    }
   },
   "outputs": [],
   "source": [
    "def resnet18(num_classes):  # 本函数已保存在d2lzh包中方便以后使用\n",
    "    def resnet_block(num_channels, num_residuals, first_block=False):\n",
    "        blk = nn.Sequential()\n",
    "        for i in range(num_residuals):\n",
    "            if i == 0 and not first_block:\n",
    "                blk.add(d2l.Residual(\n",
    "                    num_channels, use_1x1conv=True, strides=2))\n",
    "            else:\n",
    "                blk.add(d2l.Residual(num_channels))\n",
    "        return blk\n",
    "\n",
    "    net = nn.Sequential()\n",
    "    # 这里使用了较小的卷积核、步幅和填充，并去掉了最大池化层\n",
    "    net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),\n",
    "            nn.BatchNorm(), nn.Activation('relu'))\n",
    "    net.add(resnet_block(64, 2, first_block=True),\n",
    "            resnet_block(128, 2),\n",
    "            resnet_block(256, 2),\n",
    "            resnet_block(512, 2))\n",
    "    net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))\n",
    "    return net\n",
    "\n",
    "net = resnet18(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之前我们介绍了如何使用`initialize`函数的`ctx`参数在内存或单块显卡的显存上初始化模型参数。事实上，`ctx`可以接受一系列的CPU及内存和GPU及相应的显存，从而使初始化好的模型参数复制到`ctx`里所有的内存和显存上。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "3"
    }
   },
   "outputs": [],
   "source": [
    "ctx = [mx.gpu(0), mx.gpu(1)]\n",
    "net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gluon提供了上一节中实现的`split_and_load`函数。它可以划分一个小批量的数据样本并复制到各个内存或显存上。之后，根据输入数据所在的内存或显存，模型计算会相应地使用CPU或相同显卡上的GPU。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "4"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(\n",
       " [[ 5.48149364e-06 -8.33710089e-07 -1.63167692e-06 -6.36740765e-07\n",
       "   -3.82161761e-06 -2.35140669e-06 -2.54695851e-06 -9.47824219e-08\n",
       "   -6.90335582e-07  2.57562374e-06]\n",
       "  [ 5.47108630e-06 -9.42463600e-07 -1.04940591e-06  9.80820687e-08\n",
       "   -3.32518266e-06 -2.48629135e-06 -3.36428002e-06  1.04560286e-07\n",
       "   -6.10012194e-07  2.03278501e-06]]\n",
       " <NDArray 2x10 @gpu(0)>, \n",
       " [[ 5.6176350e-06 -1.2837600e-06 -1.4605525e-06  1.8302978e-07\n",
       "   -3.5511648e-06 -2.4371018e-06 -3.5731791e-06 -3.0974837e-07\n",
       "   -1.1016566e-06  1.8909888e-06]\n",
       "  [ 5.1418701e-06 -1.3729926e-06 -1.1520079e-06  1.1507450e-07\n",
       "   -3.7372806e-06 -2.8289705e-06 -3.6477188e-06  1.5781586e-07\n",
       "   -6.0733169e-07  1.9712008e-06]]\n",
       " <NDArray 2x10 @gpu(1)>)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = nd.random.uniform(shape=(4, 1, 28, 28))\n",
    "gpu_x = gutils.split_and_load(x, ctx)\n",
    "net(gpu_x[0]), net(gpu_x[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，我们可以访问已初始化好的模型参数值了。需要注意的是，默认情况下`weight.data()`会返回内存上的参数值。因为我们指定了2块GPU来初始化模型参数，所以需要指定显存来访问参数值。我们看到，相同参数在不同显卡的显存上的值一样。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "5"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not initialized on cpu(0)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(\n",
       " [[[-0.01473444 -0.01073093 -0.01042483]\n",
       "   [-0.01327885 -0.01474966 -0.00524142]\n",
       "   [ 0.01266256  0.00895064 -0.00601594]]]\n",
       " <NDArray 1x3x3 @gpu(0)>, \n",
       " [[[-0.01473444 -0.01073093 -0.01042483]\n",
       "   [-0.01327885 -0.01474966 -0.00524142]\n",
       "   [ 0.01266256  0.00895064 -0.00601594]]]\n",
       " <NDArray 1x3x3 @gpu(1)>)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weight = net[0].params.get('weight')\n",
    "\n",
    "try:\n",
    "    weight.data()\n",
    "except RuntimeError:\n",
    "    print('not initialized on', mx.cpu())\n",
    "weight.data(ctx[0])[0], weight.data(ctx[1])[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多GPU训练模型\n",
    "\n",
    "当使用多块GPU来训练模型时，`Trainer`实例会自动做数据并行，例如，划分小批量数据样本并复制到各块显卡的显存上，以及对各块显卡的显存上的梯度求和再广播到所有显存上。这样，我们就可以很方便地实现训练函数了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "7"
    }
   },
   "outputs": [],
   "source": [
    "def train(num_gpus, batch_size, lr):\n",
    "    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)\n",
    "    ctx = [mx.gpu(i) for i in range(num_gpus)]\n",
    "    print('running on:', ctx)\n",
    "    net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)\n",
    "    trainer = gluon.Trainer(\n",
    "        net.collect_params(), 'sgd', {'learning_rate': lr})\n",
    "    loss = gloss.SoftmaxCrossEntropyLoss()\n",
    "    for epoch in range(4):\n",
    "        start = time.time()\n",
    "        for X, y in train_iter:\n",
    "            gpu_Xs = gutils.split_and_load(X, ctx)\n",
    "            gpu_ys = gutils.split_and_load(y, ctx)\n",
    "            with autograd.record():\n",
    "                ls = [loss(net(gpu_X), gpu_y)\n",
    "                      for gpu_X, gpu_y in zip(gpu_Xs, gpu_ys)]\n",
    "            for l in ls:\n",
    "                l.backward()\n",
    "            trainer.step(batch_size)\n",
    "        nd.waitall()\n",
    "        train_time = time.time() - start\n",
    "        test_acc = d2l.evaluate_accuracy(test_iter, net, ctx[0])\n",
    "        print('epoch %d, time %.1f sec, test acc %.2f' % (\n",
    "            epoch + 1, train_time, test_acc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先在单GPU上训练模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "running on: [gpu(0)]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 1, time 14.0 sec, test acc 0.87\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 2, time 13.0 sec, test acc 0.90\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 3, time 13.0 sec, test acc 0.92\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 4, time 13.0 sec, test acc 0.91\n"
     ]
    }
   ],
   "source": [
    "train(num_gpus=1, batch_size=256, lr=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后尝试在2块GPU上训练模型。与上一节使用的LeNet相比，ResNet-18的计算更加复杂，通信时间比计算时间更短，因此ResNet-18的并行计算所获得的性能提升更佳。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "10"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "running on: [gpu(0), gpu(1)]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 1, time 7.3 sec, test acc 0.79\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 2, time 6.7 sec, test acc 0.81\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 3, time 6.7 sec, test acc 0.84\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 4, time 6.7 sec, test acc 0.89\n"
     ]
    }
   ],
   "source": [
    "train(num_gpus=2, batch_size=512, lr=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 小结\n",
    "\n",
    "* 在Gluon中，可以很方便地进行多GPU计算，例如，在多GPU及相应的显存上初始化模型参数和训练模型。\n",
    "\n",
    "\n",
    "## 练习\n",
    "\n",
    "* 本节使用了ResNet-18模型。试试不同的迭代周期、批量大小和学习率。如果条件允许，使用更多GPU来计算。\n",
    "* 有时候，不同设备的计算能力不一样，例如，同时使用CPU和GPU，或者不同GPU之间型号不一样。这时候，应该如何将小批量划分到内存或不同显卡的显存？\n",
    "\n",
    "\n",
    "\n",
    "## 扫码直达[讨论区](https://discuss.gluon.ai/t/topic/1885)\n",
    "\n",
    "![](../img/qr_multiple-gpus-gluon.svg)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}